Forecasting patient flows with pandemic induced concept drift using explainable machine learning

Accurately forecasting patient arrivals at Urgent Care Clinics (UCCs) and Emergency Departments (EDs) is important for effective resourcing and patient care. However, correctly estimating patient flows is not straightforward since it depends on many drivers. The predictability of patient arrivals has recently been further complicated by the COVID-19 pandemic conditions and the resulting lockdowns. This study investigates how a suite of novel quasi-real-time variables like Google search terms, pedestrian traffic, the prevailing incidence levels of influenza, as well as the COVID-19 Alert Level indicators can both generally improve the forecasting models of patient flows and effectively adapt the models to the unfolding disruptions of pandemic conditions. This research also uniquely contributes to the body of work in this domain by employing tools from the eXplainable AI field to investigate more deeply the internal mechanics of the models than has previously been done. The Voting ensemble-based method combining machine learning and statistical techniques was the most reliable in our experiments. Our study showed that the prevailing COVID-19 Alert Level feature together with Google search terms and pedestrian traffic were effective at producing generalisable forecasts. The implications of this study are that proxy variables can effectively augment standard autoregressive features to ensure accurate forecasting of patient flows. The experiments showed that the proposed features are potentially effective model inputs for preserving forecast accuracies in the event of future pandemic outbreaks.


Introduction
Urgent Care Clinics (UCCs) and Emergency Departments (EDs) are frontline healthcare providers, provisioning continuous care for a range of acute medical presentations. For a significant proportion of patients, UCCs and EDs are the initial pathways en route towards subsequent healthcare services and are thus prone to significant periodic congestion. The effects of delayed treatments at EDs due to overcrowding have been well investigated and research shows that these delays have been associated with negative clinical outcomes [42]. Although UCCs typically cater to less acute medical conditions, studies have also noted that insufficient staff availability at these facilities could lead to serious conditions being overlooked [2] in the presence of overcrowding.
Therefore it is important to implement strategies which assist with planning and the effective allocation of human resources to manage patient flows. Forecasting patient arrivals is one such strategy which can enable a greater optimisation of medical-service resourcing at facilities situated beyond primary care providers. Recently, there has been an increased academic interest in developing forecasting models for solving this problem. Forecasting of this kind can operate at different levels of granularity, such as estimating total daily patient volumes or even higher frequencies, like hourly patient arrivals. The purpose of this study is to develop daily patient-flow forecasting models using machine learning for two large UCCs providing services to patients experiencing sudden illness or accident-related injury.
Forecasting patient arrival volumes to a level of accuracy which makes the models useful is non-trivial due to numerous influencing factors as well as underlying stochastic processes. Producing reliable patient flow estimates has also been further complicated recently with the outbreak of the COVID-19 pandemic which has contributed an additional level of complexity due to the disruption to usual patient flow patterns [25,53]. These dislocations from otherwise stationary patterns in the dependent variable are termed concept drift. This phenomenon is encountered when the variable being forecasted no longer corresponds to the original data used for creating the model, ultimately resulting in the deterioration of accuracies. While these drifts can be sudden or gradual, ultimately a successful adaptive strategy depends on the ability to formulate an approach that captures the generative factors of the drifts as much as possible [43]. To date, no studies have yet emerged providing an analysis of how the concept drift arising from the COVID-19 pandemic and the resulting lockdown mandates have affected the patient-flow forecasting models, nor how the mitigation strategies can help adapt the forecasting models to increased volatility.
The current body of literature has primarily focused on forecasting solutions for EDs, with UCCs being largely overlooked; though the same drivers of patient demand and challenges apply to both UCCs and EDs. Most of the studies have concentrated on estimating total daily patient arrivals for the following day [5,21,32,39,49,51,53]. A few studies have investigated the feasibility of generating accurate forecasts as far as 30 days ahead [7,29] and some 90 days ahead [28], which entails an even greater level of uncertainty and challenge due to compounding errors. For the most part, earlier studies relied on standard statistical methods for this type of modelling [2,5,9,26]. Over time, machine learning approaches emerged [51,53], including deep learning techniques [21,39,42]. As the sophistication of the underlying algorithms has increased motivated by the attainment of high accuracies, the side effect has been a decrease in the interpretability of models. Little emphasis has been placed on clarifying the internal mechanics of the models. Examples of studies eliciting explanations of the models' reasoning underpinning their forecasts do not yet exist. Traditional time-series and simple regression methods have produced reasonable accuracies in previous studies. These approaches are still used and they continue to produce competitive results for forecasting patient flows. Research has shown that these techniques have excelled particularly in cases when the underlying data possessed consistent variations; however, it has been noticed that their accuracies become compromised when sporadic fluctuations are encountered [21]. Approaches that combine the advan-tages of traditional approaches with those of machine learning have not yet emerged in this domain.
The features used for the forecasting models in the prior works have chiefly relied on autoregressive (previous or lagged values) and various calendar variables [5,26,48] with some studies using variables indicating school holidays as well. There has been a recent trend to also incorporate weather-related variables [7,37,42,53], while air quality and pollution information have been a subject of some studies [32,40].
The aim of this study is to present a machine learning solution to forecasting total daily patient flows 7-days ahead in order to facilitate resource planning strategies at UCC facilities. This work particularly aims to investigate the feasibility of a range of real-time proxy variables to contribute to improving the overall model accuracies, as well as to employ a range of eXplainable Artificial Intelligence (XAI) techniques to both bring interpretability into the model behaviour and to extract additional insights. Additionally, this research seeks to develop mechanisms for enabling the models to adapt to the disruptive conditions of the recent COVID-19 pandemic, with learnings that can be applied to similar scenarios in future.

Contribution
The principal contribution of this study lies in the integration of novel, near real-time 1 proxy features aimed at maintaining the precision of patient-flow forecasts during periods of instability, such as the COVID-19 pandemic. The value of these proxy variables within the broader context of concept drift is highlighted as a crucial takeaway for practitioners. Additionally, the study also makes secondary contributions through the utilisation of previously unexplored machine learning models and the large number of algorithms used in experiments. A further contribution of this study is the combination of traditional and machine learning approaches in order to leverage the advantages of both. Furthermore, the study also presents the introduction of XAI techniques for the inspection of models' behaviour in this domain, providing an understanding of both the high-level characteristics of the models and their reasoning for individual forecasts.

Related work
The vast majority of prior research in forecasting patient flows has focused on EDs. Prior studies [10,13,48] have used patient flow forecasting and analysis literature from EDs and UCCs interchangeably and thus we follow the same pattern with the understanding that approaches that have been effective in the ED domain are transferable to the UCC context, and vice versa since the underlying mechanisms of demand overlap. Indeed, in their systematic literature review, Coster et al. [10] identify six key drivers of patient flow demand common to both EDs and UCCs. These included limited access to or confidence in primary care, patient-perceived urgency, convenience, recommendations by family and friends, or other healthcare professionals, and the belief that their condition required the resources and facilities offered by a particular healthcare provider. While theoretically valuable, none of these variables are generally available for use in forecasting patient demand. He et al. [22] also add to this list, population growth and ageing which carry a greater utility for forecasting long-term patient demand trends. More general causal factors for UCCs patient demand are discussed by DeLurgio et al. [13], where the authors list increasing trends in certain diseases like the prevalence of diabetes, seasonal influences like holidays as well as cyclical and irregular factors like flu epidemics and natural disasters, which also broadly align with those EDs [16]. We review published literature which is most comparable to our work in terms of studies that have predominately investigated the forecasting of total daily volumes of patient arrivals. We also focused on studies which reported their forecasting accuracies in terms of the Mean Absolute Percentage Error (defined in Equation (1)) which to some degree enables comparisons across different studies. Forecast horizons in terms of the number of days ahead being estimated are relevant for this study, and these are also highlighted and summarised where possible in Table 1.
Traditional regression and autoregressive modelling dominated the earliest research into forecasting patient demand at EDs. Batal et al. [2] successfully used multiple linear regression (MLR) to predict patient flows by incorporating a selection of calendar variables like the day of the week, month, season, and holiday flags. Jones et al. [26] subsequently used the same set of variables together with MLR as a benchmark model against comparisons with autoregressive ARIMA and SARIMA models for predicting daily patient presentations. Their work fundamentally differed in that they extended their forecast horizon to 30 days ahead. Meanwhile, Boyle et al. [5] focused on next-day ED patient flow forecasting, while using MLR, ARIMA and Exponential Smoothing (ES) models. In contrast, both Champion et al. [9] and Aboagye-Sarfo et al. [1] altered the frequency from daily, to predicting the total monthly patient demand. Both studies relied on ES with ARIMA modelling methods, while Aboagye-Sarfo et al. [1] also integrated these algorithms with experiments using VARMA.
The focus of subsequent works explored the viability of forecasting at differing frequencies together with longer forecasting horizons. One-step-ahead hourly, daily and monthly ED patient flow forecasting models with the use of calendar information (day of week, month and holidays) were developed by Boyle et al. [5]. Their work also experimented with standard approaches like ES, ARIMA and MLR, concluding that the errors increased and thus the forecasting became less reliable as the forecast granularity became finer. Marcilio et al. [29] returned to forecasting daily patient flows, exploring 7 and 30-day ahead predictions with the help of calendar as well as climatic variables. They expanded the range of techniques used to Generalized Linear Models (GLM), Generalized Estimating Equations (GEE) and Seasonal ARIMA (SARIMA). Whilst recognising the importance of forecasting further out into the future, Calegari et al. [7] addressed predicting total daily ED patient flows over 1, 7, 14, 21 and 30-days ahead, while also relying on the calendar (day of week, month and holidays) and climatic data. They also used standard SARIMA as in previous studies, but also included Seasonal ES (SES), Seasonal Multivariate Holt Winter's ES (HWES) and Multivariate SARIMA. Xu et al. [52] proposed a method that combined ARIMA and Linear Regression (ARIMA-LR) to predict daily ED patient flows for both the next day as well as 7-days ahead forecasting. Given the effectiveness of traditional approaches in this domain, more recent studies have continued to leverage these techniques like Carvalho-Silva et al. [8], who experimented with forecasting at different levels of granularity. They forecasted total patient flows on a next-week and next-month basis using ARIMA and ES, while Whitt and Zhang [49] forecasted next-day total patient flows using SARIMAX using also calendar and climatic variables.
Though traditional time-series and regression approaches possess more assumptions and stricter requirements that are harder to satisfy, they do continue to generate effective solutions for forecasting patient flows. In contrast, non-parametric machine learning approaches avoid many of the rigid assumptions required by traditional statistical methods and therefore have some added flexibility accompanied by a proclivity to overfit. Over the past decade machine learning solutions in this domain have started to emerge and have consistently produced competitive accuracies, while also exceeding those of traditional approaches in some cases [39,51]. Machine learning methods generally have the ability to capture non-linear relationships in the data and thus hold the potential to produce models which arguably better represent the complex and dynamic nature of patterns in this domain [53].
Xu et al. [51] was one of the earliest works using machine learning in this domain where they applied Artificial Neural Networks (ANN) in conjunction with variables describing seasonal influenza epidemics, as well as calendar and climatic data in order to forecast oneday ahead total daily patient flows. We see the emergence of ensemble-based techniques like Random Forest (RF) and Gradient Boosting Machines (GBM), together with ANN for forecasting daily respiratory and circulatory-related ED admissions in work by Navares et al. [32]. In this study, the authors incorporated environmental and bio-meteorological variables into the models that captured air quality indicators, and ultimately concluded that machine learning methods did outperform ARIMA when the models were combined using Stacking.
We are now beginning to see an emergence of a broader range of proxy variables being used, together with deep learning approaches and some early signs of XAI techniques entering this domain. Vollmer et al. [47] proposed patient flow models forecasting 1, 3 and 7-days ahead while also using RF and GBM, together with a k-Nearest Neighbours (kNN) regressor and stacked generalisation combining machine learning and statistical approaches. They expanded the range of proxy variables used in their models and included weather, school and public holidays, and seasonal pattern data, alongside large scheduled events which tended to see an influx of patients. Additionally, they also utilised Google search data for the keyword "flu". Long Short-Term Memory (LSTM) was used by Rocha and Rodrigues [39] where they performed a wide range of comparative experiments with those of traditional ES and SARIMA approaches and other machine techniques like Autoregressive Neural Networks (AR-NN) and XGBoost (incorporating calendar variables like day of week, month and holidays). The authors concluded that LSTM produced the best accuracies for next-day forecasts of total daily patient flows. Sudarshan et al. [42] likewise used LSTM for predicting 3 and 7-day ahead total daily patient flows. They included Convolutional Neural Networks (CNN) into their suite of algorithms and compared the results against RF. The authors also confirmed the effectiveness of deep learning methods in this domain by demonstrating that LSTM was most accurate for the 7-day forecasting, while CNN was more suited for 3-day ahead forecasting in their setting.
Harrou et al. [21] expanded the range of deep learning techniques in their work by including Deep Belief Networks (DBN), Restricted Boltzmann machines (RBM), Gated Recurrent Unit (GRU), a combination of GRU and Convolutional Neural Networks (CNN-GRU), CNN-LSTM, as well as the Generative Adversarial Network based on Recurrent Neural Networks (GAN-RNN). They designed their models to predict total daily patient flows one-day ahead, finding that the best accuracy was achieved using DBN. However, research also demonstrated that deep learning methods do not always deliver the best accuracies.
The study by Zhang et al. [53], contrasted LSTM with standard machine learning methods like kNN, Support Vector Regression (SVR), XGBoost, RF, AdaBoost, GBM and Bagging, in addition to traditional methods like OLS and Ridge Regression. This study found that it was SVR which outperformed all other algorithms based on models predicting nextday patient flows while using calendar and meteorological variables.
Ensemble-based techniques leveraging GBM such as XGBoost and LightGBM continue to increase in prominence. Guyeux and Bahi [20] used both algorithms together with RF and Lasso for hourly forecasting of patient flows. Similar to our study, the authors expanded the use of variables for forecasting beyond using standard calendar and meteorological variables. Instead, the study augmented these variables with road traffic information and epidemiological data reporting the current number of cases of influenza, chicken pox and acute diarrhoea. All in all, the study used 712 explanatory variables for forecasting and concluded that XGBoost was best performing in their setting. Meanwhile, another recent study bearing parallels with our research was conducted by Petsis et al. [37], who focused on predicting daily patient flows 1 and 2 time periods ahead. This work is reported to be the first study that has started to incorporate XAI techniques into illuminating the behaviour of the underlying models. This work primarily considered the high-level interpretability of the models which included the analysis of feature interactions, but the reasoning behind the forecasts for specific days was not considered. The study again used Gradient Boosting in the form of XGboost, relying on standard calendar and weather variables which amounted to a total of 38 variables.

Summary
It is natural to see that traditional techniques for forecasting patient flows were the most prominent approaches in earlier research. The emerging trend within the existing literature points in the direction of increasing use of machine learning algorithms, particularly those with more inherent complexity. Though machine learning approaches are achieving a widespread uptake, traditional methods continue to be used as well and generate competitive accuracies; however, they are now more often used as benchmarking models. This can be explained by the fact that machine learning algorithms tend to be a preferable choice over traditional statistical methods in forecasting, as they can handle large and complex data, adapt to changing environments and more easily handle missing and high dimensionality data which can be seen in features increasing over time (Fig. 1a). 2 Thus far, the literature indicates that amongst the researchers autoregressive, calendar and holiday variables, together with weather and bio-meteorological features remain popular model inputs for forecasting. Some recent investigations into the usefulness of the proxy variables like internet search terms and live influenza tracking indicators are emerging. Due to this and an increasing number of features being used, the suitability of non-parametric algorithms becomes evident. Traditional approaches possess advantages which can still be leveraged. However, studies predominately focus on exploring either traditional or machine learning approaches instead of attempting to combine both into a single model that is able to leverage the advantages that both techniques offer. To that end, this study seeks to fill this gap. Additionally, the literature shows that there is an emergence of an ever more diverse range of machine algorithms being used in this domain; however, there is no clear algorithm which stands out as the best-performing.
Achieving high accuracy in forecasting patient flows has been a central focus of most prior studies. Figure 1b depicts the trend in the reported forecasting accuracy over time. The figure reveals a gentle downward slope indicating a gradual reduction in error (reported as the Mean Absolute Percentage Error-MAPE) values over time. 3 It should be noted that only a single study used datasets covering the COVID-19 period which exhibited sudden and recurring concept drifts in the underlying data patterns, which result in deteriorations of accuracy. Given this gap, this study aims to explore the effectiveness of a larger set of novel proxy variables to mitigate the negative effects of concept drifts, such as the most recent one during the pandemic period. Finally, the increased usage of more complex machine learning algorithms is accompanied by the generation of uninterpretable "black box" models. A clear gap in the literature for patient flow forecasting exists in the absence of analyses which consider both the highlevel mechanics of model behaviour as well as the interrogation of the models' reasoning into specific forecasts. No study as yet using machine learning has explored the internals of the models to the fullest extent, which this study attempts to address.
In view of the existing literature, the research questions (RQ) addressed in this study are: • (RQ1) Which machine learning models attain the highest improvements over the benchmark strategies for short-term forecasting of daily patient arrivals at UCCs? Can machine learning and traditional statistical approaches be combined to produce higher accuracies? • (RQ2) Can proxy variables improve forecast accuracies and adapt the models to the concept drift caused by the COVID-19 conditions? Which variables are most effective at generally improving the forecasting accuracies of daily patient demand seven days ahead? • (RQ3) How can greater interpretability and explainability of forecasting models be achieved?

Setting
The data were sourced from Shorecare 4 which owns and operates the clinics in this study located in Auckland, New Zealand. The Smales Farm clinic provides 24-hour care services, while the Northcross clinic is reduced to after-hours care. The clinics treat standard low acuity cases and provide x-ray and fracture clinics as well as facilities for complex wound management. The Smales Farm clinic is the sole 24-hour UCC servicing a population of approximately one-quarter of a million and is situated within one kilometre of a major hospital whose ED treats ∼46,000 patients annually.

Patient flow dataset
Models were designed for predicting daily arrivals seven days ahead. The dataset recorded patient presentations spanning 11 years from 2011 through to 2022. Figure 2 shows the characteristics of the patient flows on a selection of the dataset for both clinics. Seasonal patterns indicating increases in patient arrivals are visible during the winter months of the Southern Hemisphere (June-August). This can be explained by increased presentations of Influenza and other respiratory-related illnesses. The beginning of the period affected by strong concept drifts is highlighted and deviations from the prior patterns can be observed. Additionally, the exact points corresponding to the most stringent mandates concerning COVID-19 pandemic lockdowns are highlighted, as well as periodic partial closures of the smaller clinic during portions of this period. An aggregation of the data by week number across all years can be seen in Fig. 3. There is generally a reduction in demand during the school holidays, with peaks in patient arrivals coinciding with the Winter months (mid-year), and the end of the year when the local general practitioners (GPs) are closed for holidays.  Figure 4 shows average patient arrival patterns across both clinics by day of the week.
A peak in patient presentations tends to occur during the weekend when GPs are closed, which is followed by a decrease in patient flows at the start of the working week, which then reaches minima by mid-week.

Proxy data and features
The models use a mixture of autoregressive and proxy features. Since the data exhibits pronounced seasonal and weekly cyclical patterns, we used autoregressive features representing previous patient arrival values from 7, 14, 364, 728 and 1092 days before as predictors. Additionally, the week number in the year was used to provide information describing the seasonal trends as well as recurring school holiday patterns. A public holiday flag was devised which ensured that the elevated demand driven by these days was represented.
A strategy was devised to handle the pandemic-related concept drift by using a set of variables which had the ability to capture the changes in the underlying conditions. Primarily, we created a variable that represented the severity of the legally mandated COVID-19 restrictions 5 prevailing within the clinics' locality.
Other variables were used with the assumption that they would capture both the signal indicating that concept drift was occurring, and that they would also contribute to the general improvement of the overall models by being effective proxies for patient demand. In order to qualify for consideration and be useful for real-time forecasting, these variables had to satisfy three constraints. The first being that the data needed to be measured in realtime and thus be up-to-date, secondly, the data had to be publicly available in real-time, and finally a sufficient amount of historic values of this data had to be available for model training.
Google Trends [18] data provides information on the frequency of Google search keywords over a given time period and region and satisfies the above constraints. We used this data as a proxy variable with the assumption that it could capture potential increases in various symptoms which may trigger higher patient arrivals. The selected keywords were flu, headache, sick, chest pain and cough being both generic as well as relevant to the COVID-19 pandemic. Google Trends data is accessible at weekly aggregate results provided that the search criteria fall within 5 years.
We considered the possibility of including pedestrian foot traffic as an additional proxy variable in our models, based on the underlying assumption that higher levels of pedestrian activity would indicate higher public activity levels and, therefore, an increase in Autoregressive 14-day lag, the value from two weeks prior lag1 Autoregressive 364-day lag, the value from one year prior lag2 Autoregressive 728-day lag, the value from two years prior lag3 Autoregressive 1092-day lag, the value from three years prior public_holiday Public Holiday Indicator (0/1) week Week Similar to previous studies, we also integrated information on the prevailing weather conditions. We sourced the data using APIs from Visual Crossings [46]. Our exploratory analysis indicated that the most effective feature from the suite of possible weather variables was the 'feels like' indicator that combines temperature, wind chill and heat index values.
Guided by the advice of domain knowledge of the UCC staff, we also incorporated the data from FluTracker [17]. FluTracker monitors prevailing levels of Influenza in New Zealand using surveys. Approximately 30,000 people report weekly symptoms of fever or cough being experienced. The New Zealand FluTracker data is available only from 2018 onwards, with weekly reports provided online. The data also has missing values which coincide with the summer months when the prevalence of the illness is usually low. The data is presented as the percentage of respondents exhibiting symptoms each week. Table 2 lists the names of all the variables used in various figures together with their descriptions. Appendix B shows end example of the proxy feature values versus patient flows in Fig. 11 to Fig. 14.

Benchmark models
We evaluated the efficacy of the proposed models against several benchmark models. The primary benchmark model replicated the current in-house strategy employed by the clinics' administrators to estimate patient arrivals. This approach used the patient arrival numbers from the same period in the previous year plus an additional 5% to account for an increasing trend in total volumes. The second benchmark model is a Persistence Model. It is essentially a Random Walk technique [35] which makes the forecasted value the same as that of the identical day in the previous year. The third benchmark model was ARIMA [4], created with auto-tuning. The final benchmark model consisted of an enhanced version of the Persistence Model which forecasted a value for a given day at a point in time t to be a mean of weighted autoregressive features consisting of values from t-7, t-14, t-364, t-728 and t-1092 days.

Testing approach
The models were tested on data covering five and a half years from 2017 through to mid-2022. A modified version of the expanding window approach was used for estimating the generalisability of the models. Up to a maximum of 5 previous years of training data were used for creating each model. 6 Following each training phase, the models were then tested on forecasting 7 days ahead. There were in total of 286 test sets in this hold-out approach. The models were initially trained on data from 2014-2016 7 and the forecasts were then made starting from 1 January 2017 up to 7 days ahead. Figure 5 visually depicts our training/testing methodology. The whole testing process was performed three times; once for models containing all the proposed features. The second time for models using only autoregressive features to establish the efficacy of the proxy features. Finally, the models were trained using only proxy features in order to both assess their utility and extract additional insights.

Model error measurements
The models were evaluated using several metrics with each one providing a slightly different perspective.
Mean Absolute Percentage Error (MAPE) is frequently used in literature and is recommended as the primary evaluation metric for forecasts [3]. We therefore followed this recommendation. MAPE is scale-independent and can be used to compare forecasts across datasets and studies with different ranges of values for the dependent variable. The calculation of MAPE is as follows: where T is the number of forecasts under evaluation andŷ ty t is the error or residual term arising from the difference between the observed y value and the forecast valueŷ t at time point t. MAPE is a useful measure because it is able to express deviations between the observation and the forecasted values in terms of percentages, and as such, it is easy to interpret. A non-parametric algorithm that bases its predictions on the principle of proximity, producing a forecast that is an aggregation of k nearest observations with respect to the characteristics of the data point in question.

Ridge Regression
A technique that creates a parsimonious model which shrinks the coefficients towards zero using L2 regularization. The resulting models generally reduce the variance resulting in an improved mean-squared error. Support Vector Machines Regression SVR is an extension of SVMs. It uses a pre-defined kernel function to transform the data from a non-linear space to a higher dimension in order to find an approximate fit that satisfies a pre-determined error margin. To that end, the objective function of SVR is to reduce the coefficients rather than the error term (epsilon). SVRs are particularly effective on smaller datasets and are more robust to outliers.

Kernel Ridge Regression
Kernel ridge regression extends Ridge Regression with the integration of the kernel trick technique from SVR. It differs to SVR in that it uses the squared error loss as opposed to the epsilon-insensitive loss in SVR, combined with l2 regularization. Prophet Auto-tunable, additive forecasting model with the ability to handle non-linear trends using yearly, weekly, and daily seasonality with capabilities to integrate effects from holidays, having robustness to dislocations in trend. Random Forest Regression Ensemble-based algorithm consisting of decision trees whose outputs are combined. Each decision tree is induced based on random feature subsets, resulting in an uncorrelated forest of trees. The combined accuracy of the forest results in a higher fidelity than that of any individual tree. CatBoost CatBoost is an ensemble-based algorithm that generates gradient-boosted decision trees. During training, successive trees are induced with a reduction in loss. The size of the ensemble is preset by defining the maximum number of trees as a parameter.

Voting Regressor
Ensemble-based meta-estimator. Combines machine learning and traditional time-series approaches. Initially generates models for the underlying base regressors: Prophet, CatBoost, Random Forest and ARIMA. It then combines the outputs of these algorithms for the final forecast using a weighted combination scheme.

Averaging Model
This algorithm was a customised version of the Voting Regressor which combined the outputs of five algorithms (Prophet, CatBoost, Random Forest, Voting and Stacking) but discarded the highest and lowest predictions in the calculation of each prediction. Stacking An ensemble-based meta-estimator which models the forecast outputs of the underlying base estimators (Prophet, CatBoost and Random Forest) using an overarching regressor whose output constitutes the final forecast.
We also note the Root Mean Square Error (RMSE) for each model as defined below: RMSE is instructive since it describes the dispersion of the errors while being scaled to the dependent variable, therefore smaller RMSE values are preferred.

Figure 5
The expanding window testing approach used in this study. Each testing window represents a one-week period. In total 286 training and testing cycles were performed for each model covering the periods between 2017 and July 2022 For completeness, we also report the Mean Absolute Error for all models, which in conjunction with the previous two metrics is also used by some more recent studies in this field [53]. MAE is the average absolute difference between the observation and the forecasted values. Given this property, it is to some degree conceptually easier to interpret and due to the squaring of the differences, it is less sensitive to large errors, unlike RMSE. Therefore, several significant errors will influence RMSE to a larger extent than MAE. The calculation for MAE is: Lastly, we extensively use mean ranks in order to concisely summarise the performance of all the algorithms across every test dataset. For each forecast period of 7-days, every algorithm was ranked from 1 to 15 with respect to its MAPE value, with the best performing algorithm achieving the rank of 1. This was performed across all 52 testing periods per year and across each of the 5.5 years of testing, and from this, the mean ranking was calculated.

Statistical measurements
In addition to the evaluation metrics, we also use Theil's U statistic [45] in order to assess the model accuracy relative to the persistence model where the forecast value is equal to the previous value. Since the method squares the errors, it gives more weight to large deviations and exaggerated them, which can serve as a useful method for identifying suboptimal models. The Theil's U statistic value is calculated below as: where y again is the observed value andŷ t is the forecast value at a given time step t. When interpreting this statistic, values lower than 1 indicate that the model is performing better than the persistence model, while values of 1 and beyond indicate that the forecast accuracy is equivalent to the persistence model and is in fact worse as the values increase. Finally, we report the Diebold and Mariano [14] statistical test to establish whether the sequences of forecasts of the models are meaningfully different from those of the benchmark. In order to determine this, we use this test to compare the outputs of the competing model with those of the benchmark estimation models, with respect to the observed values.

Model interpretability and explainability
The emerging field of XAI, spurred on by increasing regulatory requirements [30], addresses the challenge posed by uninterpretable models and attempts to answer both the "how" and the "why" of their decision-making in approximate terms. A set of approaches called post-modelling explainability tools, aim to answer "how" an algorithm behaves in the construction of a model during the training process, resulting in model interpretability, as well as "why" the generated model has made a specific prediction/forecast, resulting in model explainability [31]. One of the techniques that currently stands out as state-ofthe-art for extracting the interpretability and explainability of predictive models [19], is SHAP [27], which is used in this study.
We employ SHAP to examine the internal mechanics of the predictive models at both the global and local levels. At a global level of analysis, we seek to understand the overall effects that each feature exerts on the model outputs. We primarily use feature importance plots to gain this insight which typically ranks as well as depicts the relative impacts of each feature. We also examine the effects that changing feature values have on the final forecast. Additionally we use feature dependence plots in order to shed light on how pairs of features interact in order to affect the final forecast. These plots together offer a degree of high-level interpretability of the main drivers for a given model. In considering model behaviour at a local level, we attempt to extract a model's reasoning as to precisely why exactly it has produced a given forecast for a specific data point. SHAP generates new models which approximate the forecasting behaviour of the underlying "black-box" models. These models are called surrogate models and are designed to be more interpretable. SHAP (an abbreviation for SHapley Additive exPlanation) itself is based on Shapley values [41]. The technique operates on the principles of game theory. It attributes each feature's marginal contribution to the final predictive outcome in collaboration with the other features. In this way, SHAP is able to provide both global interpretability and local explainability.

Results
The results are presented in two parts. The first examines the accuracies and the statistical significance of the forecasting models, together with the efficacy of the proxy features. The effects of the concept drift on the model accuracies are highlighted from the year 2020 onward. The second part analyses both the interpretability and the explainability aspects of the models, with a focus on determining the utility of the proxy variables. The second part also attempts to extract insights concerning the model behaviour and the underlying features. Table 4 shows a high-level summary of all the models across both clinics, displaying the MAPE values for each candidate model developed with a full set of features. To establish the utility of the proxy features, each model's MAPE score is contrasted with the MAPE values of models developed using only autoregressive features, which are placed adjacently in parentheses. Both the RMSE and MAE values are also listed, together with the average rank scores based on MAPE. The table indicates that across both clinics the proposed models have outperformed those that have not used proxy features. At the bottom of the table, it can be seen that the best-performing algorithm was Voting which combined machine learning and standard statistical approaches.

Forecast modelling
The table combines the rank-ordering of the modelling techniques from worst performing to most accurate with respect to MAPE across both clinics (seen in the final column of Table 4). The results indicate that the Voting algorithm has consistently outperformed all other techniques on this dataset by leveraging the advantages of both machine learning and statistical approaches. The top three performing techniques are all ensemble-based meta approaches which have in various ways combined the models from the underlying algorithms. These approaches were closely followed by standard ensemble-based methods, with Gradient Boosting-based approaches outperforming Random Forest. The rankordered results also indicate that the existing in-house benchmark approach to patient forecasting has been significantly improved upon by the best-performing methods.
It is also observable through MAPE scores that the predictability of Clinic 2 patient flows is generally lower than that of Clinic 1. This is attributable to overall lower patient flows at Clinic 2 which predisposes it to more variability and thus higher unpredictability. This result underscores the limitations of blindly using the MAPE measure for comparisons across different studies without taking volumes into consideration since the magnitude of the total patient volumes affects variability and consequently, predictability. In saying that, comparative studies from reported literature cite MAPE accuracies ranging from 7.6% to 10.7% for 7-day forecasts (Table 1) which provides some context for the 9% and   13.1% accuracies achieved by Clinics 1 and 2 respectively, taking into account that the prior studies did not cover the pandemic period.
To highlight the effects of the pandemic-induced concept drift, Tables 5 and 6 examine the accuracies of the models by each year, across both clinics. Firstly, the tables confirm that the Voting method has consistently generated higher generalisability across all years according to the combined mean rank. However, an abrupt pandemic-induced deterioration in the accuracies can be seen for 2020 onward on both tables. The accuracy declined by 55% and 63% for the Voting method across Clinic 1 and 2 respectively from 2019 to 2020, before beginning to improve from 2021 onward. The subsequent improvement in the accuracies after 2020 indicates that the models made some measure of adjustment to the concept drift and have been able to adapt to the new conditions with the help of the proxy features. In comparison, the benchmark forecasts deteriorated by 156% and 66% across Clinic 1 and 2 respectively for the same period. To visually highlight the effects of the concept drift on the accuracies and the subsequent adaptations by the models, we render the MAPE values for both clinics, contrasting Voting and benchmark models in Fig. 6. Sharp deteriorations especially for the benchmark model can be seen in 2020 for both Figure 6 Comparison of MAPE accuracies between the benchmark and Voting models across both clinics and all the years clinics, together with adjustments which occur faster for the Voting model and thus incur a smaller deterioration in accuracy. Though the benchmark models eventually recover some of their accuracy, this occurs with a considerable delay.
Returning to the models' accuracies by year, the overall improvements of the best model (Voting) over the benchmark are ∼33% and ∼23% for Clinics 1 and 2 respectively. A detailed presentation of the percentage improvement achieved by the proposed models over the benchmark approach by year as well as overall can be seen in Tables 8 and 9 in Appendix A. The TheilU statistic comparing the Voting model to the benchmark showed that values are below 1, indicating an improvement over the benchmark. A full summary of these values can be seen in Tables 10 and 11 in Appendix A. Finally, in establishing the advantage of the Voting model over the benchmark, the Diebold-Mariano test across both clinics and for each year indicated that all Voting model forecasts can be considered significantly different (at a 1% level) to the benchmark approach.
Next, we depict an example of the Voting model's forecasting behaviour across three distinctive years, namely from 2019 to 2021, which cover both the stable (2019) and the concept-drift periods (2020-2021) in Fig. 7. This figure shows the forecasting results for estimating patient flows for Clinic 1 while contrasting them with the observed values. Relatively stable patient flows together with accurate forecasts can be seen for 2019, with several undetected patient surges occurring in that year (circled) where the discrepancy between the forecasts and the actuals was 20% or above. Significant COVID-19-related disruptions and the ensuing concept drift can be seen in the figures depicting 2020 and 2021 when recurring lockdowns and GP closures took place. In the figures, the forecasts largely demonstrate a high degree of adaptability to the underlying changes aided by the proposed proxy features which are able to incorporate new information. While the forecasts were generally effective, points of interest have been highlighted in the figures where the observed values exceeded the foretasted values by more than 20%.
To further establish the efficacy of the proxy features, Table 7 contrasts MAPE results of the benchmark and three versions of the Voting model (using only proxy-features, without proxy-features and using a full set of features). The table demonstrates an overall improvement in models using a full set of features that includes proxies features compared to those  without. The sharp difference between the models is particularly evident for the years 2020 and 2021, which supports the claim that these features are effective in mitigating the effects of concept drift. The results for 2018 and 2019, which are considered normal patterns, also show that models using proxy features perform better than those without. The results for 2017 are inconclusive, and for 2022, the models without proxy features performed better, but it should be noted that the data for 2022 is incomplete, covering only predictions up to July. 8 The contrast between the full-feature model and the one using proxy features only is of note since the latter demonstrates that it was able to account for a considerable amount of variation in the patient flows alone. During stable periods (2017-2019) the full-feature model improved over the proxy-only model between ∼18% and ∼31%. However, during the concept drift period (2020-2022), this was reduced to only a ∼7% to ∼16% improvement. The experiment indicates that a high degree of information is entailed within the proxy variables and that they are able to explain a significant degree of variation in the dependent variable, especially during the particular case of the recent pandemic-induced concept drift.

Model interpretability and explainability
We now use the SHAP technique in order to extract the Voting model's interpretability and the explainability of its forecasts for Clinic 1. Figures 8a and 8b  while values from two weeks prior and years are also prominent. However, it is clear that the prevailing COVID-19 Alert Level has gained a high-ranking position for importance in 2021. Generally, the importance of autoregressive features has been interpreted as more important by the models than that of the other proxy variables.
The SHAP summary plots in Figs. 8a and 8b communicate an additional dimension concerning global interpretability. In these figures, one can observe how an increase/decrease in the values of each underlying feature impacts the final forecast. The colours represent feature values where red equates to high values and blue to low values. The x-axis conveys the range of impact on the final forecast. The data points which have a positive SHAP value and appear to the right of the vertical zero line, have an impact on the forecast value which drives it towards predicting higher patient flow numbers. Conversely, the data points that have a negative SHAP value, to the left of the vertical zero line, decrease the forecasted patient flow numbers. As the points extend further from the zero vertical line, their effect and contribution to the final forecast correspondingly increase.
From Figs. 8a and 8b, a general pattern can be seen where the higher values of the features have a correlated positive effect on pushing the forecasts toward higher totals. In Fig. 8b, an exception can be seen in the effects of the COVID-19 Alert Level, where higher levels result in decreasing the forecasted patient flow. The impact of this feature as well as that of the public holidays is also asymmetrical. In other words, the model responds more strongly to forecasting lower patient demand if the values for the COVID-19 Alert Level are high, then it would predict a higher patient flow if the values for these features were low. The reverse holds for the public holiday feature. Some ambiguity exists in the effects that the values of 'feels like' , 'pedestrian count' as well as 'flu percent' have on the final prediction.
Figures 8c-f expose the explainability of the models for a selection of specific forecast dates. These figures depict the top nine features for each forecast and their values on the y-axis. The features are rank-ordered by their impact on the final prediction. The figure can be interpreted as a contest of forcing effects between all the features. The expected value of each figure is denoted as E[f (x)] which is the average value of all the forecasted data points. A final SHAP value at the top represents the eventual forecasted outcome. Blue bars represent the forcing effects towards smaller forecast patient flows, while red indicates the opposite. The size of the bars represents the effect size that each feature and its corresponding values exert. These graphs are best interpreted from the bottom up.
The forecasts for Figs. 8c and 8e were made in 2019 which explains why the COVID-19 Alert Level did not play a role in the forecasts. In Fig. 8c, the 7 and 14-day autoregressive features were the most impactful; however, their effects cancelled each other out, while Google search terms pushed the forecasts towards higher estimates. In contrast, Fig. 8e shows that the autoregressive value from two years before had a strong effect on higher forecasts, and pedestrian counts featured more prominently in pushing the final forecast down. Figures 8d and 8f represent the concept-drift phase. In these figures the strong influence of the COVID-19 Alert Level can be seen responding to the changes in the dependent variable, pushing both forecasts down due to the high prevailing pandemic situation. In both cases, the pandemic level eclipsed the remaining features and successfully rectified the final forecast toward an accurate estimation.
Here, we revisit the proxy-only models to study their effects and utility, which we conduct with the removal of the autoregressive features. Figures 9a-f depict the behaviour of these models. The same analysis approach is followed as previously by using data up to 2019 and 2021. Again, the first two Figs. 9a and 9b depict the interpretability of the models. In both years, it can be seen that pedestrian foot traffic is a significant driver of the forecasts being negatively correlated with the forecasted patient flows. Google trends and 'feels like' data similarly exert a positively correlated influence across both years; however, for 2021 the COVID-19 Alert Level becomes the second most important feature with a negatively correlated impact on the patient flows.
Figures 9c and 9e depict the explainability of the forecasts for two separate dates in 2019. Both explanatory figures demonstrate the strong influence of the prevailing influenza conditions as a strong variable for this period, while Google trends searches are the strongest for the former. For 2021, we observe that Google trends search results, as well as pedestrian foot traffic, are influential on both dates in Figs. 9d and 9f; however, the COVID-19 Alert Level is an overwhelmingly influential variable for the latter in driving the forecasts down due to the highest prevailing pandemic level of 4 occurring in Fig. 9f. The different magnitudes of effect that the COVID-19 Alert Level is explained by Fig. 9b, where we see that high alert levels exert a considerably larger effect on the forecast than lower alert levels exert on increasing the forecasts.
We conclude with a high-level analysis of the model's behaviour to draw out further insights, but this time considering how pairs of features interact in order to influence the final forecasts. We depict a selection of these interactions in Figs. 10a to 10d focusing primarily on proxy features. Each plot contains a horizontal line which indicates the threshold representing a transition from increasing and decreasing effects that pairs of features have on the final forecast Figure 10a shows how the effect on the final forecast changes as the COVID-19 Alert Level increases (x-axis) and as the value for Google Trends search terms rises (colour gradient on the right y-axis). The interaction of low COVID-19 Alert Level values and high Google Trends search terms exerts the strongest effect on raising the forecasts. The forecasts are significantly downgraded as the COVID-19 Alert Level values increase and the lockdown measures take effect but at the same time, some of the highest Google Trends search terms regarding respiratory terms are encountered during the highest lockdown conditions. 9 Several insights can be extracted from Fig. 10b depicting the interaction between the reported increase in influenza prevalence and the weather variable. Firstly, it can be observed that influenza levels between 0.5% and 1% in conjunction with high weather values tend to correspond with positive effects on the forecasts. This can be explained by summer conditions where there is an uptick in sport-related injuries and general outdoor accident-based cases. However, influenza levels below 0.5% tend to be inconsistent. As the influenza prevalence increases from 1% to 3%, this has a positive effect on forecasts as would be expected during the winter months, and can be confirmed by the fact that most of the weather-perception values are downward trending. An unexpected pattern can be seen as influenza prevalence levels increase beyond 3% where counter-intuitively a mixed, and more negative effect is exerted on the forecasts which is contrary to the reported experience of the clinic's staff. 10 Meanwhile, in Fig. 10c it is observable that as the pedestrian traffic increases, there is a generally a decreasing effect on the forecasts 11 until the highest pedestrian traffic values 9 The greatest values indeed coincided with the first and the highest COVID-19 Alert Level announced in March 2020. 10 The negative effect of high influenza prevalence may be due to uncertainty in the data itself which raises some questions about the reliability of this feature since it is incomplete with the data being available from 2018 onward. It may also be explained by the fact that this feature is a nationwide indicator and influenza incidence is likely to be different in colder regions of the country than in Auckland. The fact that the highest values of flu prevalence coincide with high weather values as negative drivers of forecasts may be suggestive that this period corresponds with the earlier beginning of spring in Auckland than in other regions, which leads to a sharp actual decline in respiratory related illness. 11 This is explainable by the fact that the pedestrian traffic peaks midweek when the patient arrivals are at their lowest. Meanwhile, the pedestrian traffic is lowest in the CBD during the weekend when patient demand is typically at its highest.
are reached, at which point, the behaviour of the models becomes more erratic, possibly due to public holidays and large events. The highest COVID-19 Alert Level values together with the ensuing lowest pedestrian traffic values, aggressively decrease the forecasts as expected. The highest positive effects on the forecasts are seen in data points where the pedestrian traffic is low but the COVID-19 Alert Level is less than 4 which corresponds with the most stringent lockdown mandates. This also corresponds with the loosening of the lockdown restrictions which in turn trigger increased clinic visits due to the pent-up demand during the highest lockdown conditions.
In the last illustration, Fig. 10d depicts the effects of increasing Google Trends search terms with respect to changes to weather conditions. We see that some of the highest Google Trends search values (from 80 onward) correspond with the strongest negative forecasting effects. We can correlate this result with Fig. 10 which shows that some of the highest Google Trends values occurred during the highest lockdown measures. A strong negative effect on the forecasts can be explained by the increase in the general anxiety levels of the homebound populace, resulting in an increase in Google searches covering terms related to respiratory illnesses, while at the same time discouraging the population from seeking non-urgent medical attention to lessen the risk of exposure as directed by the high covid alert levels. Therefore, it can be surmised that unusually high Google Trends search values from 80 onward are exceptional as they are attributable to severe lockdown conditions. Meanwhile, Google Trends search values ranging from 50 to 80 can be interpreted as belonging to the winter months and represent typical patterns that are confirmed by the low weather temperatures for this range. The effect of the summer months and the lowest Google Trends search values can also be observed where the strongest negative effects on the forecast values are exerted.

Discussion
A comprehensive empirical investigation was conducted using a broad set of models and features which achieved considerably improved accuracies over the benchmark approaches as well as competitive accuracies with respect to prior studies. In addressing the study's first research question (RQ1), we find that the ensemble-based Voting method consistently outperformed all other candidate algorithms for this setting. The approach successfully combined a mixture of both machine learning and statistical techniques in order to leverage the advantages of both. This finding aligns with the literature on ensemblebased machine learning which attests to the benefits of using such methods in comparison to utilizing single models. The theory of ensemble-based machine learning posits that by aggregating the predictions of diverse models, the resulting ensemble will on average exhibit superior performance across a wide range of datasets when compared to utilizing a single model.
The abrupt onset of the concept drift caused by the COVID-19 pandemic conditions degraded the forecasting accuracies initially. Nonetheless, the experiments indicated that the suite of novel real-time proxy variables were effective at adjusting the models to the new and evolving drivers of patient flows. Not only were the proposed features useful at triggering adaptations to the disruptions of historic patterns, but they were also effective at improving the model forecasts in general (RQ2). Indeed, the experiments showed that reasonable models could be generated using only proxy variables without reliance on autoregressive feature values. The study confirmed that the COVID-19 Alert Level feature was particularly effective at forecasting patient volumes during disruptive periods, while determining the effectiveness of the Google search terms, pedestrian foot traffic as well as weather data. While the reported prevalence of influenza prevalence was a useful feature and contributed to some degree in improving forecasts, it was less reliable than the other proxy features likely due to its incompleteness. The implications of this finding is that accurate forecasting of patient flows should remain possible using these features in the event of future pandemic outbreaks.
A particularly novel contribution of this study was the extensive use of XAI tools in order to expose the internals of the forecasting models. The use of the SHAP technique to achieve global interpretability of the models was highlighted and proved effective in depicting the various effects that features and their values bear on the final forecasts (RQ3). Additionally, at the local level, the technique demonstrated its ability to clearly explain the drivers of the forecasted values for individual days. Meanwhile, the dependence plots yielded extra insights into the interactions of various features and how they collectively influence the final forecasts. This also enabled the validation of the models against domain knowledge to take place.
The findings of this study contain practical implications for practitioners. While the results of this study are specific to the two clinics in question, replication of our approach to other clinics in different cities would require re-evaluation of the features used. However, the insights and findings of this study can inform practitioners about the potential utility of these features for generalisation and their potential responsiveness to concept drifts. Additionally, the study's results can inform practitioners about the potential performance and limitations of different machine learning algorithms when applied to patient flow prediction tasks. Our analysis approach is therefore generalisable to different contexts, and the demonstrated XAI tools can be utilised to evaluate and fine-tune models. We offer some recommendations to practitioners: • Consider the timeliness and granularity of data: Choose real-time proxy variables that are available in a timely manner and at the appropriate level of granularity for accurate forecasting, ensuring that there also exists a sufficient amount of historic values for model training. • Consider unique attributes of the clinic: When selecting real-time proxy variables for patient forecasting, consider the unique features of the clinic, such as its location, the local transportation system and nearby hospitals. • Explore diverse data sources: Look beyond traditional data sources, and explore other sources of data, such as social media, Google Analytics, ED capacities at nearby hospitals and local General Practitioners if available. • Consider using ensemble methods: Model forecasts using ensemble methods, such as combining multiple algorithms or models, to improve forecasting accuracy. • Evaluate and detect concept drift: Simple concept drift methods that track the error rate of a model can be applied for this purpose where drift is detected when the error rate increases significantly compared to a minimum observed error rate. Confidence intervals can be used to determine if a significant increase in error rate has occurred. • Anticipate and mitigate for concept drift: Stay ahead of some causes of concept drift which can come in the form of changes in nearby hospital operations of their EDs, competition from new UCCs and reforms in healthcare policies concerning subsidies. Concept drifts can also accompany infrastructure and transportation changes as well as economic downturns. While these factors can lead to fluctuations in patient arrival patterns and demand for services, exploratory modelling using additional flag variables that attempt to represent these events may act to preserve forecasting accuracy.

Conclusion
Patient flows in Urgent Care Clinics (UCCs) and Emergency Departments (EDs) have been experiencing increasing pressure as the volumes have grown and have become less predictable using ad hoc approaches. The ability to accurately forecast patient arrivals in these contexts, and to understand better what some of the drivers of demand are, is important for the efficient and effective functioning of these healthcare providers by enabling them to respond faster and achieve more optimal human resourcing. There are many factors which affect patient arrivals and randomness accounts for some of this. Autoregressive variables as well as calendar and meteorological indicators have already been explored in prior research and have been found useful for estimating patient flows. This study goes further by also considering additional quasi-real-time variables like Google search terms, pedestrian traffic and prevailing incidence levels of influenza, alongside COVID-19 Alert Level indicators to improve the forecast accuracy.
This study makes a unique contribution in several respects. Not only does this study integrate a wide variety of new features for forecasting patient flows, but it also considers both the effects that the concept drift triggered by the recent pandemic conditions had on the forecast accuracies, and the means by which the models can rapidly adapt to the changing context in order to learn and recover. This research also ventured beyond the standard studies using machine learning methods in this domain, by utilising tools from the eXplainable AI field in order to expose the internals of the models, thus achieving both the high-level interpretability of the models and the explainability of their individual forecasts.
Our research determined that the Voting ensemble-based method performed most reliably in this setting and the final accuracies were competitive with those from prior studies. While the autoregressive and calendar features were important, the experiments indicated that the prevailing COVID-19 Alert Level feature together Google search terms and pedestrian traffic were effective at generating accurate forecasts. Tables 8 and 9 show the improvements achieved by the top seven proposed methods using all features over the benchmark as a percentage of MAPE. The results are broken down by year.

A.2 TheilU statistical test on the significance of model improvements over the benchmark
Tables 10 and 11 display the results of the TheilU statistical test showing the detailed improvements attained by the top seven proposed methods using all features over the benchmark. The test is performed across all years as a summary as well as broken down by year     Table 12 shows the detailed MAPE accuracies of the proxy-only models for Clinic 1 contrasted with univariate ARIMA and Prophet models. The univariate models outperform the proxy-only models in the period leading up to the pandemic, but subsequently the Voting proxy-only models shows general improvements over the univariate methods as the concept drift takes effect from 2020 onward.